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Gaussian Hierarchical Model 

Linxiao Yang, Jun Fang, Hong Cheng, and Hongbin Li, Senior Member, IEEE 


Abstract —We consider a dictionary learning problem whose 
objective is to design a dictionary such that the signals admits a 
sparse or an approximate sparse representation over the learned 
dictionary. Such a problem finds a variety of applications such as 
image denoising, feature extraction, etc. In this paper, we propose 
a new hierarchical Bayesian model for dictionary learning, in 
which a Gaussian-inverse Gamma hierarchical prior is used to 
promote the sparsity of the representation. Suitable priors are 
also placed on the dictionary and the noise variance such that 
they can be reasonably inferred from the data. Based on the 
hierarchical model, a variational Bayesian method and a Gibbs 
sampling method are developed for Bayesian inference. The 
proposed methods have the advantage that they do not require 
the knowledge of the noise variance a priori. Numerical results 
show that the proposed methods are able to learn the dictionary 
with an accuracy better than existing methods, particularly for 
the case where there is a limited number of training signals. 

Index Terms —Dictionary learning, Gaussian-inverse Gamma 
prior, variational Bayesian, Gibbs sampling. 

1. Introduction 

Sparse representation has been of significant interest over 
past few years and has found a variety of applications in 
practice as many natural signals admit a sparse or an approxi¬ 
mate sparse representation in a certain basis [l]-[3]. In many 
applications such as image denoising and interpolation, signals 
are assumed to admit a sparse representation over a pre¬ 
specified non-adaptive dictionary, e.g. discrete consine/wavelet 
transform (DCT/DWT) bases. Nevertheless, recent research 

[4] , [5] has shown that the recovery, denoising and classifi¬ 
cation performance can be considerably improved by utilizing 
an adaptive dictionary that is learned from the training signals 

[5] , [6]. This has inspired studies on dictionary learning whose 
objective is to design overcompelete dictionaries that can bet¬ 
ter represent the signals. A number of algorithms, such as K- 
singular value decomposition (K-SVD) [4], method of optimal 
directions (MOD) [7], dictionary learning with the majoriza- 
tion method [8], and simultaneous codeword optimization 
(SimCO) [9], were developed for learning overcomplete dictio¬ 
naries for sparse representation. Most algorithms formulate the 
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dictionary learning as an optimization problem and solve it via 
a two-stage iterative process, namely, a sparse coding stage and 
a dictionary update stage. The main difference between these 
algorithms lies in the dictionary update stage. Specifically, the 
MOD method [7] updates the dictionary via solving a least 
square problem which admits a closed-form for the dictionary 
update. The K-SVD algorithm [4], instead, updates atoms 
of the dictionary in a sequential manner and while updating 
each atom, the atom is updated along with the nonzero 
entries in the corresponding row vector of the sparse matrix. 
The idea of this sequential atom update was later extended 
to sequentially updating multiple atoms each time [9], and 
recently was generalized to parallel atom-updating in order 
to further accelerate the convergence of the iterative process 
[10]. These methods [4], [7]-[10], although delivering state- 
of-the-art performance, require the knowledge of the sparsity 
level or the noise/residual variance to define the stopping 
criterion for estimating the sparse codes (e.g. [4]), or select 
appropriate values for the regularization parameters controlling 
the tradeoff between the sparsity level and the data fitting error 
(e.g. [8], [10]). In practice, however, the prior information 
about the noise variance is usually unavailable and an inaccu¬ 
rate estimation may result in substantial performance degra¬ 
dation. To mitigate this limitation, a nonparametric Bayesian 
dictionary learning method called as beta-Bernoulli process 
factor analysis (BPFA) was recently developed in [11]. The 
proposed method is able to automatically infer the required 
number of factors (dictionary elements) and the noise variance 
from the image under test, which is deemed as an important 
advantage over other dictionary learning methods. For [11], 
the posterior distributions cannot be derived analytically, and 
a Gibbs sampler was used for Bayesian inference. We also 
note that a class of online dictionary learning algorithms 
were developed in [12]-[14]. Different from the above batch- 
based algorithms [4], [7], [9], [10] which use the whole set 
of training data for dictionary learning, online algorithms 
continuously update the dictionary using only one or a small 
batch of training data, which enables them to handle very large 
data sets. 

In this paper, we propose a new hierarchical Bayesian 
model for dictionary learning, in which a Gaussian-inverse 
Gamma hierarchical prior is used to promote the sparsity of the 
representation. Suitable priors are also placed on the dictionary 
and the noise variance such that they can be reasonably 
inferred from the data. Based on the hierarchical model, a 
variational Bayesian method and a Gibbs sampling method are 
developed for Bayesian inference. For both inference methods, 
there are two different ways to update the dictionary: we 
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can update the whole set of atoms at once, or update the 
atoms in a sequential manner. When updating the dictionary 
as a whole, the proposed variational Bayesian method has a 
dictionary update formula similar to the MOD method. Never¬ 
theless, unlike the MOD method which alternates between two 
separate stages (i.e. dictionary update and sparse coding), for 
our algorithm, the dictionary and the signal are refined in an 
interweaved and gradual manner, which enables the algorithm 
to come to a reasonably nearby point as the optimization 
progresses, and helps avoid undesirable local minima. For the 
Gibbs sampler, a sequential update seems able to expedite 
the convergence rate and helps achieve better performance. 
Simulation results show that the proposed Gibbs sampling 
algorithm presents uniform superiority over other state-of-the- 
art dictionary learning methods in a number of experiments. 

The rest of the paper is organized as follows. In Section [III 
we introduce a hierarchical prior model for learning dictionar¬ 
ies. Based on this hierarchical model, a variational Bayesian 
method and a Gibbs sampler are developed in Section [III| 
and Section [IV| for Bayesian inference. Simulation results are 
provided in Section [Vl followed by concluding remarks in 
Section 

II. Hierarchical Model 

Suppose we have L training signals where yi G 

. Dictionary learning aims at finding a common sparsifying 
dictionary D G such that these L training signals 

admit a sparse representation over the overcomplete dictionary 
D, i.e. 

yi=Dxi^wi V/ (1) 

where xi and wi denote the sparse vector and the resid¬ 
ual/noise vector, respectively. Define Y = [y^ ... X = 
[xi ... xl], 2ind W = [wi ... iul], the model ([T]) can be 
re-expressed as 

Y = DX + W (2) 

Also, we write D = [di ... d^], where each column of the 
dictionary, is called an atom. 

In the following, we develop a Bayesian framework for 
learning the overcomplete dictionary and sparse vectors. To 
promote sparse representations, we assign a two-layer hierar¬ 
chical Gaussian-inverse Gamma prior to X. The Gaussian- 
inverse Gamma prior is one of the most popular sparse- 
promoting priors which has been widely used in compressed 
sensing [15]-[17]. In the first layer, X is assigned a Gaussian 
prior distribution 

N L 

p{X\a) = jq Y[pixni) 

n=l 1=1 
N L 

n=l 1=1 

where Xni denotes the (n, /)th entry of X, and a = {ani} are 
non-negative sparsity-controlling hyperparameters. The second 



Fig. 1. Hierarchical model for dictionary learning. 

layer specifies Gamma distributions as hyperpriors over the 
hyperparameters {a^i}, i.e. 

N L 

p{cx) = nn Gamma(ctnz|a, b) 

n=l 1=1 

N L 

( 4 ) 

n=l 1=1 

where r(a) = is the Gamma function, and 

the parameters a and b used to characterize the Gamma 
distribution are usually chosen to be small values, e.g. 10 “^. 
As discussed in [18], this hyperprior allows the posterior mean 
of aril to become arbitrarily large. As a consequence, the 
associated coefficient Xni will be driven to zero, thus yielding 
a sparse solution. In this paper, we choose a value of a = 0.5 
in order to achieve a more sparsity-encouraging effect. Clearly, 
the Gamma prior with a larger a encourages large values of 
the hyperparameters, and therefore promotes the sparseness of 
the solution since the larger the hyperparameter, the smaller 
the variance of the corresponding coefficient. 

In addition, in order to prevent the dictionary from becom¬ 
ing infinitely large, we assume the atoms of the dictionary 
{dn} are mutually independent and each atom is placed a 
Gaussian prior, i.e. 

N N 

p{D)= n A/" {dfi 1 0 , /3T) (5) 

n=l n=l 

where /3 is a parameter whose choice will be discussed 
later. The noise {wi} are assumed independent multivariate 
Gaussian noise with zero mean and covariance matrix (I/ 7 )/, 
where the noise variance I /7 is assumed unknown a priori. 
To estimate the noise variance, we place a Gamma hyperprior 
over 7 , i.e. 

^( 7 ) = Gamma( 7 |c, d) = V{c)~^ ( 6 ) 

where we set c = 0.5 and d = 10“^. The proposed hierarchical 
model (see Fig. [T]) provides a general framework for learning 
the overcomplete dictionary, the sparse codes, as well as the 
noise variance. In the following, we will develop a variational 
Beyesian method and a Gibbs sampling method for Bayesian 
inference. 

III. Variational Inference 
A. Review of The Variational Bayesian Methodology 

Before proceeding, we firstly provide a brief review of the 
variational Bayesian methodology. In a probabilistic model, let 
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y and 0 denote the observed data and the hidden variables, 
respectively. It is straightforward to show that the marginal 
probability of the observed data can be decomposed into two 


terms 

Inp(y) = L{q) + KL{q\\p) 

(7) 

where 

L{q) = j q{d)\n^-i^dd 

(8) 

and 

(9) 

mq\\p) = - j q(.e)\n^i^de 


where q{9) is any probability density function, KL(g| \p) is the 
Kullback-Leibler divergence between p{6\y) and q{6). Since 
KL(g||p) > 0, it follows that L{q) is a rigorous lower bound 
on Inp{y). Moreover, notice that the left hand side of (|7]) is 
independent of q{0). Therefore maximizing L{q) is equivalent 
to minimizing KL{q\\p), and thus the posterior distribution 
p{6\y) can be approximated by q{6) through maximizing 
L{q). 

The significance of the above transformation is that it cir¬ 
cumvents the difficulty of computing the posterior probability 
p{9\y) (which is usually computationally intractable). For a 
suitable choice for the distribution q{9), the quantity L{q) 
may be more amiable to compute. Specifically, we could 
assume some specific parameterized functional form for q{9) 
and then maximize L{q) with respect to the parameters of 
the distribution. A particular form of q{9) that has been 
widely used with great success is the factorized form over the 
component variables {Oi} in 9 [19], i.e. q{9) = Yl- qi{0i). We 
therefore can compute the posterior distribution approximation 
by finding q{9) of the factorized form that maximizes the 
lower bound L{q). The maximization can be conducted in an 
alternating fashion for each latent variable, which leads to [19] 


^ exp{{\np{y,e))k^i) 

J exp{{\np{t,d))k^i)d0i 


( 10 ) 


where {')k^i denotes an expectation with respect to the 
distributions qi{0i) for ail k ^ i. 


B. Proposed Variational Bayesian Method 

We now proceed to perform variational Bayesian inference 
for the proposed hierarchical model. Let 9 = {X,a,D, 7 } 
denote all hidden variables. We assume posterior independence 
among the variables X, cx, D and 7 , i.e. 

p{e\y) Mq{x,a,D,j) 

=qx{x)qa{a)qd{D)q^{'y) ( 11 ) 

With this mean field approximation, the posterior distribution 
of each hidden variable can be computed by maximizing L{q) 
while keeping other variables fixed using their most recent 
distributions, which gives 

lnq^{X) ={\np{Y,X,D, a, ^))q^(^D)q^(a)q^i-y)+constant 
In qd{D) ={lnp{Y,X,D,a,^))q^(^x)qo.{a)q^{'y) + constant 
In qa{a) ={\np{Y,X,D,a,^))g^(^x)q^(D)q^{'y) + constant 
In 57 ( 7 ) =(lnp(l^, a, 7))fe(A:)g<i(0)9o(«) + constant 


where {)qi{■)...qKi-) denotes the expectation with respect to 
(w.r.t.) the distributions {qk{')}k=i - summary, the posterior 
distribution approximations are computed in an alternating 
fashion for each hidden variable, with other variables fixed. 
Details of this Bayesian inference scheme are provided below. 

1). Update of qx{X): The calculation of qx{X) can be 
decomposed into a set of independent tasks, with each task 
computing the posterior distribution approximation for each 
column of X, i.e. qx{xi). We have 

\nq^{xi) (x{ln\p{yi\D,Xi,j)p{xi\cxi)])g^t^D)q^(oc)q^(j) 

( 12 ) 


where ai = {q;„;}7i sparsity-controlling hyperpa- 

rameters associated with xu p{yi\D^xi^y) and p{xi\oLi) are 
respectively given by 


M 

P{yi\D,xu'y) = " exp 


iWVi - Dx 


l \\2 


N 


p{xi\ai) = W^M{xni\0,aJ;) 


(13) 


n=l 


Substituting ([13]) into (fT^ and after some simplifications, 
it can be readily verified that qx{xi) follows a Gaussian 
distribution 


qx{xi)=M{xi\yf,i:^) (14) 

with its mean pif and covariance matrix Yf given respectively 
as 

nf =(7)Sr(D)^y, 

Sf = (( 7 )(r>^r>) + (A,))”' (15) 

where ( 7 ) denotes the expectation w.r.t. ^ 7 ( 7 ), {D) and 
(D'^D) denote the expectation w.r.t. qd{D), and (A/) = 
diag((Q(i/),..., (ofAT/)), in which (ani) represents the expec¬ 
tation w.r.t. qa{oL). 

2). Update of qd{D): The approximate posterior qd{D) can 
be obtained as 

lnqd{D) <x{ln\p{Y\X,D,‘y)p{D)])g^^x)g^(j) 

N 

x{-j\\Y - DX\\l - /3-^Y.^ndn) 

n=l 

cx(- 7 tr{(l" - DX){Y - DXf} - l3-hr{DD'^}) 
(x{tr{D{-/XX'^ + p-n)D^ - 2'yYX'^D'^}) 

=tr{D{{^){XX^) + r^I)D^ - 2 ( 7 )F(X)^£>^} 

(16) 

where for simplicity, we have dropped the subscript of the (•) 
operator. Define 

A4((7)(XX^)+/3-1/)-i 

B Hj)Y{Xf 

The posterior qd{D) can be further expressed as 
\nqd{D) (xiv{DA-^D^ - 2BD^} 

M 

= y2idm-A-^d^. - 


2bm.<C.) (17) 
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where bm- and dm- represents the mth row of B and D, 
respectively. It can be easily seen from (Hii) that the posterior 
distribution qd{D) has independent rows and each row follows 
a Gaussian distribution with its mean and covariance matrix 
given by 6 ^. A and A, respectively, i.e. 

M M 

qd{D) = P[ p{dra.) = n (18) 

m=l m=l 

3) , Update of qa{cx)\ The variational optimization of ^^(a) 
yields 

Inq^ia) cx(lnp(X|a)p(a)),^(X) 

N L 

=EEa T^p{xni\ani)p{anr,a,b)) 

n=l 1=1 ^ ^ 

(19) 

Thus OL has a form of a product of Gamma distributions 

N L 

qa{a) = nn Gamma(a„; ;a,bni) (20) 

n=l 1=1 

in which the parameters d and bni are respectively given as 
d = a+- bni = b ^-{xl^i) (21) 

4) . Update of qj{j)\ The variational optimization of ^ 7 ( 7 ) 
yields 

In9^(7) <x{\np{Y\D, X,j)p{j))g^mQAX) 

L 

cx(ln P[p(y;|D, a;;, 7)^(7)) 

1 = 1 

ln7 - I E(y' “ 

^ 1=1 

+ (c — 1 ) In 7 — dj) 

= (^ + c - 1 ) ln 7 - - DX\\%) +dy 

( 22 ) 

Therefore q^ij) follows a Gamma distribution 

^^ 7 ( 7 ) = Gamma( 7 |c, d) (23) 

with the parameters c and d given respectively by 

. ML 
c=— + c 

d=d+]^{\\Y -DX\\l) (24) 

where 

(||l^ - DX\\l) =(tr{(r - DXf{Y - DX)}) 

= \\Y-{D){X)\\%+iv{{D^D){XX^)] 
-ix{{D'^){D){X){X^)} (25) 

In summary, the variational Bayesian inference involves 
updates of the approximate posterior distributions for hidden 


variables X, D, a, and 7 . Some of the expectations and 
moments used during the update are summarized as 

{XX^) ={x){xf + 

1=1 

(D) =BA 

{D^D) ={Df{D) + M{A) 

(p^ni) —djbfii 

( 7 ) =c/d 

where in (a), ljf[n] denotes the nth entry of fjf, E^[n,n] 
represents the nth diagonal element of and ( 6 ) follows 
from ([18]). For clarity, we summarize our algorithm as follows. 

Sparse Bayesian Dictionary Learning - A Variational 
Bayesian Algorithm 

1. Given the current posterior distributions qd{D), qa{cx) 
and ^ 7 ( 7 ), update the posterior distribution qx{X) 
according to (O. 

2. Given qx{X), qa{cy), and ^ 7 ( 7 ), update qd{D) ac¬ 
cording to (dHi). 

3 . Given qx{X), qd{D) and ^^ 7 ( 7 ), update qa{ct) ac¬ 
cording to (l2Qb . 

4. Given qx{X), qd{D) and qa{oL), update ^ 7 ( 7 ) ac¬ 
cording to (1231) . 

4. Repeat the above steps until a stopping criterion is 
reached. 

Remarks: We discuss the choice of the parameter /3 which 
defines the variance of the dictionary atoms. We might like to 
set (3 equal to 1 /m such that the norm of each atom has unit 
variance. Our experiment results, however, suggest that a very 
large value of /3, e.g. 10^, leads to better performance. In fact, 
choosing an infinitely large p implies placing non-informative 
priors over the atoms {dn}, in which case the update of the 
dictionary is simplified as 

{D) = BA = Y{X'^){XX'^)-^ (26) 

This update formula is similar to the formula used for dic¬ 
tionary update in the MOD method, except with the point 
estimate X and XX^ replaced by the posterior mean {X) 
and (XX^), respectively. Nevertheless, unlike the MOD 
method which alternates between two separate stages (i.e. 
dictionary update and sparse coding), for our algorithm, the 
dictionary and the signal are refined in an interweaved and 
gradual manner, which enables the algorithm to come to a 
reasonably nearby point as the optimization progresses, and 
helps avoid undesirable local minima. This explains why our 
proposed method outperforms the MOD method. 

In the above algorithm, atoms are updated in a parallel 
way. By assuming posterior independence among atoms {dn}, 
our method can also be readily adapted to update atoms in a 
sequential manner, i.e. update one atom at a time while fixing 
the rest atoms in the dictionary. The mean field approximation. 
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in this case, can be expressed as 
p{e\y) ^q{x,a,D,'y) 

N 

=qx{x)qa{a)Y\_<idr,{dn)q'i{'y) (27) 

n=l 


provides better performance than the variational Bayesian 
inference. 

Let 6 = {X,a,L>, 7 } denote all hidden variables in our 
hierarchical model. We aim to find the posterior distribution 
of 9 given the observed data Y 


The posterior distribution qd^{dn) can then be computed by 
maximizing L{q) while keeping other hidden variables fixed 
using their most recent distributions, which leads to 


p{0\Y) ^p{Y\D,X,^)p{D)p{X\cx)p{cx)p{^) (33) 


To provide an approximation to the posterior distribution of the 
hidden variables, the Gibbs sampler generates an instance from 
In < 7 ^ (d i (xl\wr)(Y X \di\ n/ distribution of each hidden variable in turn, conditional 

^ on the current values of the other hidden variables. It can be 

(x{\np{Y\X, {dk},^)p{dn))q^(^x)Ylkj^n^dk{dk)q-fii) shown (see, for example, [ 20 ]) that the sequence of samples 


a:\lnp(l^ 

Si(7tr{(l^“^ - dnXn.){Y~'^ - dnXn.Y) 


+/3 

1 

'2 




where in (a), we define 


I , \ (rl W constitutes a Markov chain, and the stationary distribution of 

I niXn-^j)p[ n))q^(^x)Yl^^^qdj^idk)q ^Markov chain is just the sought-after joint distribution. 

Specifically, the sequential sampling procedure of the Gibbs 
sampler is given as follows. 

• Sampling X according to its conditional marginal distri- 
bution p{X\Y, a(*), 7 W); 

Sampling D according to its conditional marginal distri¬ 
bution p{D\Y, 7 ^^^); 

Sampling a according to its conditional marginal distri- 
bution p{a\Y,D^^+^\X^*+^\'y(-*'>y, 

Sampling 7 according to its conditional marginal distri- 


dn dji) 

dl{{^){Xn.xl)+p-^)-^dr 


A y _ 


-2dn{Y-^)(xl) 

(28) 

(29) 


in which is generated by D with the nth column of D 
replaced by a zero vector, and Xn- denotes the nth row of X, 
(b) comes from the fact that Y~^ — d^Xn- = W and thus we 
have 

p{Y-^\dn^Xn.,7) =P{Y~" - dnXn.) 

= exp - 2711^”" - dnXn- IIf j 

(30) 

From ([28]), it can be seen that dn follows a Gaussian distri¬ 
bution 


bution p( 7 |l^, a^^+i)). 

Note that the above sampling scheme is also referred to as a 
blocked Gibbs sampler [21] because it groups two or more 
variables together and samples from their joint distribution 
conditioned on all other variables, rather than sampling from 
each one individually. Details of this sampling scheme are 
provided below. For simplicity, the notation p{z\—) is used 
in the following to denote the distribution of variable 2 ; 
conditioned on all other variables. 

1). Sampling X\ The samples of X can be obtained 
by independently sampling each column of X, i.e. xi. The 
conditional marginal distribution of xi is given as 


qaMn) = ^f{dn\^ltK) ( 31 ) 

with the mean and the covariance matrix given respectively 
by 

yi =^i{Y-^){xl) 

K={{^){xr..xl) + rT^I (32) 

where {Xn-X^) is the nth diagonal element of (XX^), and 
= Y — {D~^){X). Our proposed algorithm therefore 
can be readily extended to a columnwise update procedure by 
replacing the update of qd{D) with the sequential update of 

Qdn (^n)? 


p{xi\-) (xp{Y\X,D,-f)p{xi\oLi) 

(X p{yi\D, xi,-f)p{xi\oLi) (34) 

Recalling ([T3]) . it can be easily verified that p{xi\—) follows 
a Gaussian distribution 

p{xi\-)=X{fif,J:f) (35) 

with its mean fif and covariance matrix given by 

yf = (36) 

Sf = (7D^D + A,)-^ (37) 


IV. Gibbs Sampler 

Gibbs sampling is an effective alternative to the variational 
Bayes method for Bayesian inference. In particular, different 
from the variational Bayes which provides a locally-optimal, 
exact analytical solution to an approximation of the posterior, 
Monte Carlo techniques such as Gibbs sampling provide a 
numerical approximation to the exact posterior of hidden vari¬ 
ables using a set of samples. It has been observed in a series 
of experiments (including our results) that the Gibbs sampler 


where Ai = diag(Q(i/,..., axi). 

2). Sampling D: There are two different ways to sample 
the dictionary: we can sample the whole set of atoms at once, 
or sample the atoms in a successive way. Here, in order to 
expedite the convergence of the Gibbs sampler, we sample the 
atoms of the dictionary in a sequential manner. The conditional 
distribution of dn can be written as 

p{dn\-)^p{dn)p{Y\D,X,j) 

^p{dn)p{Y-^\dn,Xn.,j) (38) 
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where is defined in (l29l) . Recalling (l30l) , we can show 
that the conditional distribution of dn follows a Gaussian 
distribution 


pK|-)=A/-(Mts;^) (39) 

with its mean and covariance matrix given by 

fit = (40) 

= ilXn.xl. + (41) 


3). Sampling ol\ The log-conditional distribution of ani can 
be computed as 


\a.p{ani\-) oc \\ip{anV,a,b)p{xni\ani) 



It is easy to verify that ani still follows a Gamma distribution 

p{ani\-) = Ga.mma{a,bni) (43) 

with the parameters a and bni given as 

a = a + I (44) 

bni=b + ^xh (45) 

4). Sampling 7: The log-conditional distribution of 7 is 
given by 

lnp(7|-) oc lnp(F|£),X,7)^(7) 

L 

oc lnP[p(y;|£),a;,,7)p(7) 
i=i 

= (^+c-l)ln7- Q||r-DX||| + d^7 

(46) 

from which we can arrive at 

p( 7 |—) = Gamma(c, d) (47) 


where 


c = a + 


ML 


d = d+-\\Y -DX\\l 


(48) 

(49) 


So far we have derived the conditional marginal distribu¬ 
tions for hidden variables {i),X,a, 7 }. Gibbs sampler suc¬ 
cessively generates the samples of these variables according 
to their conditional distributions. After a burn-in period, the 
generated samples can be viewed as samples drawn from the 
posterior distribution p(X, D, a, With those samples, 

the dictionary can be estimated by averaging the last few 
samples of the Gibbs sampler. For clarity, we now summarize 
the Gibbs sampling algorithm as follows. 


Sparse Bayesian Dictionary Learning - A Gibbs 
Sampling Algorithm 


T Given the current samples and Gen- 

erate a sample according to ( 1 ^ . 

2. Given the current samples and 7 ^^^ 

Generate a sample according to (1^ . 

3. Given the current samples and 7 ^^^ 

Generate a sample according to (l43l) . 

4. Given the current samples and 

Generate a sample 7 ^^+^) according to (l47l) . 

5. Repeat the above steps and collect the samples after 
a burn-in period. 


V. Simulation Results 

We now carry out experiments to illustrate the performance 
of our proposed sparse Bayesian dictionary learning (SBDL) 
methods, which are respectively referred to as SBDL-VB and 
SBDL-Gibbs. Throughout our experiments, the parameters for 
our proposed method are set equal to a = 0.5, b = 10“^, c = 
0.5, and d = 10“^. The parameter p is set to /3 = 10^ for the 
SBDL-VB and /3 = 1 for SBDL-Gibbs. Note that the SBDL- 
Gibbs is insensitive to the choice P and here we simply choose 
P = 1. We compare our proposed methods with other existing 
state-of-the-art dictionary learning methods, namely, the K- 
SVD algorithm [4], the atom parallel-updating (APrU-DL) 
method [10], and the Bata-Bernoulli process factor analysis 
(BPFA) method [11]. Both the synthetic data and real data are 
used to test the performance of respective algorithms. 

A. Synthetic Data 

We generate a dictionary D of size 20 x 50, with each entry 
independently drawn from a normal distribution. Columns of 
D are then normalized to unit norm. The training signals 
produced based on D, where each signal Pi is 
a linear combination of Ki randomly selected atoms and the 
weighting coefficients are i.i.d. normal random variables. Two 
different cases are considered. First, all training samples are 
generated with the same number of atoms, i.e. Ki = AT, V/, 
and K is assumed exactly known to the K-SVD method. 
The other case is that Ki varies from 3 to 6 for different I 
according to a uniform distribution. In this case, the K-SVD 
assumes that the sparsity level equals to 6 during the sparse 
coding stage. The observation noise is assumed multivariate 
Gaussian with zero mean and covariance matrix cr^J. Note 
that the APrU-DL (with FISTA) method requires to set two 
regularization parameters A and to control the tradeoff 
between the sparsity and the data fitting error. The selection 
of these two parameters is always a tricky issue and an 
inappropriate choice may lead to considerable performance 
degradation. To show this, we use the following two different 
choices: {A,As} = {0.2,0.15} and {A,As} = {0.4,0.4}, in 
which the former set of values are carefully selected to achieve 
the best performance, and the latter set of values slightly 
deviate from the former set of values. We use APrU-DL-F to 
denote the APrU-DL method which uses the former choice, 
and APrU-DL-L to denote the APrU-DL method which uses 
the latter one. For SBDL-Gibbs, the number of iterations is set 
to 300 and the estimate of the dictionary is simply chosen to be 
the last sample of the Gibbs sampler. For a fair comparison. 
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the competing algorithms including K-SVD, APrU-DL, and 
BPFA are executed sufficient numbers of iterations to achieve 
their best performance. 

The recovery success rate is used to evaluate the dictionary 
learning performance. The success rate is computed as the 
ratio of the number of successfully recovered atoms to the 
total number of atoms. An atom is considered successfully 
recovered if the distance between the original atom and the 
estimated atom is smaller than 0.01, where the distance is 
defined as 


where di denotes the estimated atom. Table I] shows the 
average recovery success rates of respective algorithms, where 
we set L = 1000 and L = 2000, respectively, and the signal- 
to-noise ratio (SNR) varies from 10 to lOOdB. Results are 
averaged over 50 independent trials. From Table Jl we can see 
that: 

• The proposed SBDL-Gibbs method achieves the high¬ 
est recovery success rates in most cases. The pro¬ 
posed SBDL-VB method, although not as well as the 
SBDL-Gibbs, still provides quite decent performance and 
presents a clear performance advantage over the K-SVD 
and APrU-DL methods when the number of training 
signals is limited, e.g. L = 1000. In particular, both the 
SBDL-Gibbs and the SBDL-VB outperform the BPFA 
method by a big margin, although all these three methods 
were developed in a Bayesian framework. 

• In the low SNR regime, e.g. SNR = lOdB, the K-SVD 
method suffers from a significant performance loss when 
there is a discrepancy between the presumed sparsity level 
and the groundtruth (see the case where Ki varies but the 
presumed sparsity level is fixed to 6). 

• The APrU-DL method is sensitive to the choice of the 
regularization parameters. It provides superior recovery 
performance when the regularization parameters are prop¬ 
erly selected. Nevertheless, as we can see from Table H 
the APrU-DL method incurs a considerable performance 
degradation when the parameters deviate from their opti¬ 
mal choice, and there is no general guideline suggesting 
how to choose appropriate values for these regularization 
parameters. 

B. Application To Image Denoising 

We now demonstrate the results by applying the above 
methods to image denoising. Suppose images are corrupted 
by white Gaussian noise with zero mean and variance . We 
partition a noise-corrupted image into a number of overlapping 
patches of size 8x8 pixels. Note that in our simulations, not 
all patches are selected for training, but only those patches 
whose top-left pixels are located at [r x 7,r x j] for any 
= 0 ,..., [(Q — 8)/rJ are selected, where Q denotes 
the dimension of the Q x Q image, and r is chosen to 
be r = {2,4}, respectively. The selected patches are then 
vectorized to generate the training signal {vi). Also, in our 
experiments, we assume that the noise variance is perfectly 


TABLE I 

Recovery Success Rates 


L 

SNR 

Algorithm 

K = 3 

K = 4 

K = 5 

Var. K 



K-SVD 

80.52 

36.36 

2.52 

0.80 



BPFA 

64.48 

38.00 

11.60 

26.56 


10 

APrU-DL-F 

85.64 

64.40 

33.44 

53.68 


APrU-DL-L 

48.20 

17.48 

4.68 

12.52 



SBDL-VB 

86.00 

63.84 

16.28 

47.48 



SBDL-Gibbs 

91.52 

62.48 

6.32 

41.80 



K-SVD 

93.20 

93.44 

92.08 

84.68 



BPFA 

83.20 

85.88 

85.00 

85.08 


20 

APrU-DL-F 

94.04 

93.32 

87.76 

93.48 


APrU-DL-L 

72.48 

40.32 

14.15 

33.04 



SBDL-VB 

97.28 

95.96 

92.32 

94.48 

1000 


SBDL-Gibbs 

99.64 

99.16 

97.52 

99.12 


K-SVD 

94.24 

94.32 

93.92 

86.64 



BPFA 

75.72 

80.68 

82.96 

81.56 


30 

APrU-DL-F 

94.24 

94.92 

88.16 

93.96 


APrU-DL-L 

73.40 

43.16 

17.16 

34.36 



SBDL-VB 

96.60 

96.16 

92.32 

95.48 



SBDL-Gibbs 

99.60 

99.16 

98.64 

99.00 



K-SVD 

94.24 

94.32 

93.92 

85.44 



BPFA 

75.88 

78.96 

82.16 

78.24 


100 

APrU-DL-F 

94.36 

93.84 

88.68 

93.64 


APrU-DL-L 

74.88 

44.72 

18.12 

36.08 



SBDL-VB 

97.20 

97.52 

92.24 

94.56 



SBDL-Gibbs 

99.32 

99.24 

98.24 

98.96 



K-SVD 

91.00 

88.88 

50.56 

25.32 



BPFA 

85.44 

82.84 

67.92 

81.24 


10 

APrU-DL-F 

97.00 

94.88 

86.24 

95.44 


APrU-DL-L 

84.84 

68.36 

42.28 

64.04 



SBDL-VB 

92.92 

81.80 

55.68 

77.16 



SBDL-Gibbs 

98.56 

95.72 

80.20 

93.88 



K-SVD 

95.64 

96.68 

95.16 

94.00 



BPFA 

84.44 

87.16 

88.48 

86.68 


20 

APrU-DL-F 

95.40 

96.48 

95.80 

96.56 


APrU-DL-L 

85.32 

82.44 

64.48 

79.84 



SBDL-VB 

97.64 

96.56 

92.12 

95.04 

2000 


SBDL-Gibbs 

99.48 

99.56 

98.92 

99.16 


K-SVD 

95.88 

96.92 

96.96 

93.36 



BPFA 

76.84 

81.00 

83.60 

81.64 


30 

APrU-DL-F 

94.28 

95.00 

96.80 

95.64 


APrU-DL-L 

86.32 

82.40 

66.08 

80.52 



SBDL-VB 

96.88 

96.96 

92.96 

94.96 



SBDL-Gibbs 

99.40 

99.16 

99.52 

99.32 



K-SVD 

96.04 

97.88 

96.88 

92.20 



BPFA 

75.24 

80.12 

83.00 

81.36 


100 

APrU-DL-F 

95.56 

95.48 

96.08 

96.00 


APrU-DL-L 

86.36 

82.92 

65.76 

79.56 



SBDL-VB 

96.84 

96.56 

94.32 

96.36 



SBDL-Gibbs 

99.40 

99.44 

99.24 

99.40 


known a priori by the K-SVD method. For the APrU-DL 
method, the regularization parameters A and Ag are carefully 
chosen to be A = 25 and A^ = 30. After the training by 
respective algorithms, the trained dictionary is then used for 
denoising. The denoising process involves a sparse coding 
of all patches (including those used for training and those 
not) of size 8x8 pixels from the noisy image. Due to its 
simplicity and fast execution, the orthogonal matching pursuit 
(OMP) method is employed to perform the sparse coding 
of all patches. The final estimate of each pixel is obtained 
by averaging the associated pixel from each of the denoised 
overlapping patches in which this pixel is included. 

Table nil shows the peak signal to noise ratio (PSNR) results 
obtained for different nature images by respective algorithms, 
where the noise standard deviation is set to a = {15, 25, 50}, 
respectively, and the dictionary to be inferred is assumed of 




























































TABLE II 
PSNR Results 


r 

(T 

Algorithm 

boat 

cameraman 

couple 



K-SVD 

29.2802 

31.4638 

31.4068 



BPFA 

29.2988 

30.8684 

31.0950 


15 

APrU-DL 

29.5718 

31.7662 

31.5304 



SBDL-VB 

29.3557 

31.1741 

31.0691 



SBDL-Gibbs 

29.5881 

31.6978 

31.4473 



K-SVD 

26.9308 

28.6211 

28.6949 



BPFA 

26.9576 

28.1639 

28.5825 

2 

25 

APrU-DL 

26.8998 

28.7069 

28.5378 



SBDL-VB 

26.6959 

28.1587 

28.4240 



SBDL-Gibbs 

27.1570 

28.8380 

28.8431 



K-SVD 

22.9499 

23.9898 

24.3532 



BPFA 

23.5059 

22.8861 

24.3181 


50 

APrU-DL 

22.7274 

23.5888 

24.1901 



SBDL-VB 

23.0861 

23.3194 

24.3299 



SBDL-Gibbs 

23.4651 

24.1899 

24.7870 



K-SVD 

29.2585 

31.3553 

31.3513 



BPFA 

28.8131 

29.7561 

30.3464 


15 

APrU-DL 

29.4554 

31.5541 

31.4276 



SBDL-VB 

29.3217 

31.0739 

31.1359 



SBDL-Gibbs 

29.5376 

31.4931 

31.5443 



K-SVD 

26.6756 

28.4350 

28.5580 



BPFA 

26.3727 

26.8885 

27.5139 

4 

25 

APrU-DL 

26.7240 

28.4447 

28.4097 



SBDL-VB 

26.5977 

28.0960 

28.3715 



SBDL-Gibbs 

27.0077 

28.5539 

28.7889 



K-SVD 

22.7708 

23.2908 

24.2388 



BPFA 

23.0100 

22.0422 

23.4463 


50 

APrU-DL 

22.6036 

23.3086 

24.1107 



SBDL-VB 

23.0404 

23.3315 

24.4163 



SBDL-Gibbs 

23.2525 

23.8610 

24.6326 


size 64 X 256. The PSNR is defined as 


PSNR = 20 logio 


/ 255 X \ 


where IJ and U denote the denoised image and the original 
image, respectively. From Table HH we see that the results 
of all methods are very close to each other in general. The 
proposed SBDL-Gibbs achieves a slightly higher PSNR than 
other methods in most cases, particularly when less number of 
signals is used for training. This result again demonstrates the 
superiority of the proposed method. In Fig.[2]and[3l we present 
the noise-corrupted images “cameraman” and “couple”, and 
the denoised images using dictionaries trained by our proposed 
algorithms. The trained dictionaries are also shown on the right 
sides of Fig. [2] and [S] 


VI. Conclusions 

We developed a new Bayesian hierarchical model for learn¬ 
ing the overcomplete dictionaries based on a set of training 
data. This new framework can be considered as an adaptation 
of the conventional sparse Beysian learning framework to deal 
with the dictionary learning problem. Specifically, a Gaussian- 
inverse Gamma hierarchical prior is used to promote the 
sparsity of the representation. Suitable priors are also placed 
on the dictionary and the noise variance such that they can be 
reasonably inferred from the data. We developed a variational 
Bayesian method and a Gibbs sampler for Bayesian inference. 
Unlike some of previous methods, the proposed methods 
do not need to assume knowledge of the noise variance a 


priori, and can infer the noise variance automatically from the 
data. The performance of the proposed methods is evaluated 
using synthetic data. Numerical results show that the proposed 
methods are able to learn the dictionary with an accuracy 
considerably better than existing methods, particularly for the 
case where there is a limited number of training signals. The 
proposed methods are also applied to image denoising, where 
superior denoising results are achieved even compared to other 
state-of-the-art algorithms. Our proposed hierarchical model 
is also flexible to incorporate additional prior information to 
enhance the dictionary learning performance. 
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Fig. 2. Example of the denoising results for the image “Cameraman” (cr = 25 and r=2). (From left to right) The corrupted image, the denoised image by 
SBDL-VB (28.1587dB), the denoised image by SBDL-Gibbs (28.8380dB), the dictionary trained by SBDL-VB, the dictionary trained by SBDL-Gibbs. 



Fig. 3. Example of the denoising results for the image “Couple” (cr = 25 and r=2). (From left to right) The corrupted image, the denoised image by 
SBDL-VB (28.4240dB), the denoised image by SBDL-Gibbs (28.843IdB), the dictionary trained by SBDL-VB, the dictionary trained by SBDL-Gibbs. 
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